{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5b7c0298",
   "metadata": {},
   "source": [
    "# Part D: 自适应提升法\n",
    "\n",
    "## 1. Adaboost概述\n",
    "\n",
    "Adaboost的全称是Adaptive Boosting，其含义为自适应提升算法。其中，自适应是指Adaboost会根据本轮样本的误差结果来分配下一轮模型训练时样本在模型中的相对权重，即对错误的或偏差大的样本适度“重视”，对正确的或偏差小的样本适度“放松”，这里的“重视”和“放松”具体体现在了Adaboost的损失函数设计以及样本权重的更新策略。本课我们将介绍Adaboost处理分类和回归任务的算法原理，包括SAMME算法、SAMME.R算法和Adaboost.R2算法。\n",
    "\n",
    "## 2. 分类损失\n",
    "\n",
    "对于$K$分类问题而言，当样本标签$\\mathbf{y}=[y_1,...,y_K]^T$的类别$c$为第$k$类时，记\n",
    "\n",
    "$$\n",
    "y_k=\\left\\{\n",
    "\\begin{aligned}\n",
    "&1,\\quad &{\\rm if}\\ c=k\\\\\n",
    "&-\\frac{1}{k-1},\\quad &{\\rm if}\\ c\\neq k\n",
    "\\end{aligned}\n",
    "\\right.\n",
    "$$\n",
    "\n",
    "设模型的输出结果为$\\mathbf{f}=[f_1,...,f_K]^T$，则记损失函数为\n",
    "\n",
    "$$\n",
    "L(\\mathbf{y},\\mathbf{f})=\\exp(-\\frac{\\mathbf{y}^T\\mathbf{f}}{K})\n",
    "$$\n",
    "\n",
    "由于对任意的向量$a\\mathbb{1}$有\n",
    "\n",
    "$$\n",
    "L(\\mathbf{y}, \\mathbf{f}+a\\mathbb{1})=\\exp(-\\frac{\\mathbf{y}^T\\mathbf{f}}{K}-\\frac{a\\mathbf{y}^T\\mathbb{1}}{K})=\\exp(-\\frac{\\mathbf{y}^T\\mathbf{f}}{K})=L(\\mathbf{y}, \\mathbf{f})\n",
    "$$\n",
    "\n",
    "因此为了保证$\\mathbf{f}$的可估性，我们需要作出约束假设，此处选择对称约束条件\n",
    "\n",
    "$$\n",
    "f_1+f_2+...+f_K=0\n",
    "$$\n",
    "\n",
    "从概率角度而言，一个设计良好的分类问题损失函数应当保证模型在期望损失达到最小时的输出结果是使得后验概率$P(c\\vert \\mathbf{x})$达到最大的类别，这个条件被称为贝叶斯最优决策条件。在本问题下，满足对称约束条件的损失函数期望损失$\\mathbb{E}_{\\mathbf{Y}\\vert\\mathbf{x}}L(\\mathbf{Y},f)$达到最小时，由拉格朗日乘子法可解得模型输出为\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "k^* & =\\mathop{\\arg\\max}_kf_k^*(\\mathbf{x})\\\\\n",
    "& =\\mathop{\\arg\\max}_k (K-1)[\\log P(c=k\\vert \\mathbf{x})-\\frac{1}{K}\\sum_{i=1}^K\\log P(c=i\\vert \\mathbf{x})] \\\\\n",
    "& =\\mathop{\\arg\\max}_k P(c=k\\vert \\mathbf{x})\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "因此，选择指数损失能够满足贝叶斯最优决策条件。\n",
    "\n",
    "## 3. SAMME\n",
    "\n",
    "SAMME算法的全称是\\textbf{S}tagewise \\textbf{A}dditive \\textbf{M}odeling using a \\textbf{M}ulti-class \\textbf{E}xponential loss function，它假定模型的总输出$\\mathbf{f}$具有$\\mathbf{f}^{(M)}(\\mathbf{x})=\\sum_{m=1}^M \\beta^{(m)} \\mathbf{b}^{(m)}(\\mathbf{x})$的形式。其中，$M$是模型的总迭代轮数，$\\beta^{(m)}\\in \\mathbb{R^+}$是每轮模型的加权系数，$\\mathbf{b}^{(m)}(\\mathbf{x}) \\in\\mathbb{R}^K$是基模型$G$输出类别的标签向量。设样本的标签类别为$k$，当基模型预测的样本类别结果为$k'$时，记\n",
    "\n",
    "$$\n",
    "b^{(m)}_{k'}=\\left\\{\n",
    "\\begin{aligned}\n",
    "&1,\\quad &{\\rm if}\\ k'=k\\\\\n",
    "&-\\frac{1}{k-1},\\quad &{\\rm if}\\ k'\\neq k\n",
    "\\end{aligned}\n",
    "\\right.\n",
    "$$\n",
    "\n",
    "对于第$m$轮迭代而言，上一轮的模型输出为$\\mathbf{f}^{(m-1)}(\\mathbf{x})$，本轮需要优化得到的$\\beta^{*(m)}$和$\\mathbf{b}^{*(m)}$满足\n",
    "\n",
    "$$\n",
    "(\\beta^{*(m)}, \\mathbf{b}^{*(m)})=\\mathop{\\arg\\min}_{\\beta^{(m)}, \\mathbf{b}^{(m)}}\\sum_{i=1}^n L(\\mathbf{y}_i, \\mathbf{f}^{(m-1)}(\\mathbf{x}_i)+\\beta^{(m)}\\mathbf{b}^{(m)}(\\mathbf{x}_i))\n",
    "$$\n",
    "\n",
    "由于$\\mathbf{f}^{(m-1)}(\\mathbf{x}_i)$在第$m$轮为常数，记\n",
    "\n",
    "$$\n",
    "w_i=\\exp(-\\frac{1}{K}\\mathbf{y}_i^T\\mathbf{f}^{(m-1)}(\\mathbf{x}_i))\n",
    "$$\n",
    "\n",
    "此时有\n",
    "\n",
    "$$\n",
    "(\\beta^{*(m)}, \\mathbf{b}^{*(m)})=\\mathop{\\arg\\min}_{\\beta^{(m)}, \\mathbf{b}^{(m)}}\\sum_{i=1}^n w_i\\exp(-\\frac{1}{K}\\beta^{(m)}\\mathbf{y}_i^T\\mathbf{b}^{(m)}(\\mathbf{x}_i))\n",
    "$$\n",
    "\n",
    "设当轮预测正确的样本索引集合为$T$，则损失可表示为\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\tilde{L}(\\beta^{(m)}, \\mathbf{b}^{(m)})&=\\sum_{i=1}^n w_i\\exp(-\\frac{1}{K}\\beta^{(m)}\\mathbf{y}_i^T\\mathbf{b}^{(m)}(\\mathbf{x}_i)) \\\\\n",
    "&= \\sum_{i\\in T}w_i\\exp[-\\frac{\\beta^{m}}{K-1}]+\\sum_{i \\notin T}w_i\\exp[\\frac{\\beta^{(m)}}{(K-1)^2}] \\\\\n",
    "&= \\sum_{i\\in T}w_i\\exp[-\\frac{\\beta^{m}}{K-1}] +\\sum_{i\\notin T}w_i\\exp[-\\frac{\\beta^{m}}{K-1}] \\\\\n",
    "&\\qquad - \\sum_{i\\notin T}w_i\\exp[-\\frac{\\beta^{m}}{K-1}] +\\sum_{i \\notin T}w_i\\exp[\\frac{\\beta^{(m)}}{(K-1)^2}]\\\\\n",
    "&=\\exp[-\\frac{\\beta^{(m)}}{K-1}]\\sum_{i=1}^nw_i + \\{ \\exp[\\frac{\\beta^{(m)}}{(K-1)^2}]-\\exp[-\\frac{\\beta^{(m)}}{K-1}] \\}\\sum_{i=1}^nw_i\\mathbb{I}_{\\{i\\notin T\\}}\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "注意到$\\mathbf{b}^{(m)}$仅与$\\sum_{i=1}w_i\\mathbb{I}_{\\{i\\notin T\\}}$有关（因为基学习器的好坏控制了样本是否能够正确预测），且此项前的系数非负（因为$\\beta^{(m)}$非负），因此得到\n",
    "\n",
    "$$\n",
    "\\mathbf{b}^{*(m)}=\\mathop{\\arg\\min}_{\\mathbf{b}^{(m)}}\\sum_{i=1}^n w_i\\mathbb{I}_{\\{i\\notin T\\}}\n",
    "$$\n",
    "\n",
    "在得到$\\mathbf{b}^{*(m)}$后，通过求$\\tilde{L}$关于$\\beta^{(m)}$的导数并令之为0可解得\n",
    "\n",
    "$$\n",
    "\\beta^{*(m)}=\\frac{(K-1)^2}{K}[\\log\\frac{1-err^{(m)}}{err^{(m)}}+\\log(K-1)]\n",
    "$$\n",
    "\n",
    "其中，样本的加权错误率为\n",
    "\n",
    "$$\n",
    "err^{(m)}=\\sum_{i=1}^n\\frac{w_i}{\\sum_{i=1}^nw_i}\\mathbb{I}_{\\{i\\notin T\\}}\n",
    "$$\n",
    "\n",
    "样本$\\mathbf{x}_i$在第$m$轮的预测类别为$k_i^*=\\mathop{\\arg\\max}_{k} \\mathbf{f}^{(m)}(\\mathbf{x}_i)$，其中\n",
    "\n",
    "$$\n",
    "\\mathbf{f}^{(m)}(\\mathbf{x}_i)=\\mathbf{f}^{(m-1)}(\\mathbf{x}_i)+\\beta^{*(m)}\\mathbf{b}^{*(m)}(\\mathbf{x}_i)\n",
    "$$\n",
    "\n",
    "```{figure} ../_static/ada_algo1.png\n",
    "---\n",
    "width: 440px\n",
    "align: left\n",
    "---\n",
    "```\n",
    "\n",
    "事实上，我们还能通过一些多分类的性质来改写算法的局部实现，使得一些变量前的系数得到简化。记\n",
    "\n",
    "$$\n",
    "\\alpha^{*(m)}=\\log\\frac{1-err^{(m)}}{err^{(m)}}+\\log(K-1)\n",
    "$$\n",
    "\n",
    "此时，$w_i$每轮会被更新为\n",
    "\n",
    "$$\n",
    "\\tilde{w}_i = w_i\\cdot\\exp[\\frac{1-K}{K}\\alpha^{*(m)}]\\exp(\\alpha^{*(m)}\\mathbb{1}_{\\{i\\notin T\\}})\n",
    "$$\n",
    "\n",
    "对$\\mathbf{w}$进行归一化操作后，不会对下一轮算法1中$G^*$和$err^{(m)}$的结果产生任何影响。同时，如果把算法1第12行的$\\beta^{*(m)}$替换为$\\alpha^{*(m)}$，由于它们的输出结果只相差常数倍$\\frac{(K-1)^2}{K}$，因此最后的预测结果$c(\\mathbf{x})$也不会产生任何变化。\n",
    "\n",
    "由于$\\exp[\\frac{1-K}{K}\\alpha^{*(m)}]$是样本公共项，故我们可以每次都利用\n",
    "\n",
    "$$\n",
    "\\tilde{w}_i = w_i\\cdot\\exp(\\alpha^{*(m)}\\mathbb{1}_{\\{i\\notin T\\}})\n",
    "$$\n",
    "\n",
    "来更新，而不影响归一化结果。\n",
    "\n",
    "此时，算法1的迭代循环可进行如下重写：\n",
    "\n",
    "```{figure} ../_static/ada_algo2.png\n",
    "---\n",
    "width: 380px\n",
    "align: left\n",
    "---\n",
    "```\n",
    "\n",
    "## 4. SAMME.R\n",
    "\n",
    "许多分类器都能够输出预测样本所属某一类别的概率，但是SAMME算法只能利用分类的标签信息，而不能利用这样的概率信息。SAMME.R算法通过损失近似的思想，将加权分类模型的概率输出信息与boosting方法相结合。SAMME.R中的字母“R”代表“Real”，意味着模型每轮迭代的输出为实数。\n",
    "\n",
    "不同于SAMME在第$m$轮需要同时考虑得到最优的$\\beta^{(m)}$和$\\mathbf{b}^{(m)}$，SAMME.R将其统一为$\\mathbf{h}^{(m)}\\in \\mathbb{R}^K$，它需要满足对称约束条件$\\sum_{i=1}^K h_k=0$以保证可估性。此时，损失函数为\n",
    "\n",
    "$$\n",
    "L(\\mathbf{h}^{(m)})=\\exp[-\\frac{1}{K}\\mathbf{y}^T(\\mathbf{f}^{(m-1)}(\\mathbf{x})+\\mathbf{h}^{(m)}(\\mathbf{x}))]\n",
    "$$\n",
    "\n",
    "为了与概率联系，我们需对损失$L$的后验概率进行最小化，即\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\mathbf{h}^{*(m)}&=\\mathop{\\arg\\min}_{\\mathbf{h}^{(m)}} \\mathbb{E} [L\\vert \\mathbf{x}]\\\\\n",
    "&=\\mathop{\\arg\\min}_{\\mathbf{h}^{(m)}} \\mathbb{E}_{\\mathbf{y}}[ \\exp[-\\frac{1}{K}\\mathbf{y}^T(\\mathbf{f}^{(m-1)}(\\mathbf{x})+\\mathbf{h}^{(m)}(\\mathbf{x}))]\\vert \\mathbf{x}]\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "设样本$\\mathbf{y}$对应的标签为$S(\\mathbf{y})$，则\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\mathbb{E} [L\\vert \\mathbf{x}] &= \\mathbb{E}_{\\mathbf{y}}[ \\exp[-\\frac{1}{K}\\mathbf{y}^T\\mathbf{f}^{(m-1)}(\\mathbf{x})]\n",
    "\\exp[-\\frac{1}{K}\\mathbf{y}^T\\mathbf{h}^{(m)}(\\mathbf{x})]]\\vert \\mathbf{x}]\\\\\n",
    "&=\\sum_{k=1}^K\\left. [\\exp[-\\frac{1}{K}\\mathbf{y}^T\\mathbf{f}^{(m-1)}(\\mathbf{x})]\\exp[-\\frac{1}{K}\\mathbf{y}^T\\mathbf{h}^{(m)}(\\mathbf{x})]]\\right|_{S(\\mathbf{y})=k}P(S(\\mathbf{y})=k\\vert \\mathbf{x})\\\\\n",
    "&=\\sum_{k=1}^K \\left.[\\exp[-\\frac{1}{K}\\mathbf{y}^T\\mathbf{f}^{(m-1)}(\\mathbf{x})]\\right|_{S(\\mathbf{y})=k}P(S(\\mathbf{y})=k\\vert \\mathbf{x})]\\exp(-\\frac{h^{(m)}_k(\\mathbf{x})}{K-1})\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "记$w=\\exp[-\\frac{1}{K}\\mathbf{y}^T\\mathbf{f}^{(m-1)}(\\mathbf{x})]$，则\n",
    "\n",
    "$$\n",
    "\\mathbb{E} [L\\vert \\mathbf{x}] = \\sum_{k=1}^K \\left.w\\right|_{S(\\mathbf{y})=k}\\cdot P(S(\\mathbf{y})=k)\\exp(-\\frac{h^{(m)}_k(\\mathbf{x})}{K-1})\n",
    "$$\n",
    "\n",
    "不难发现对于样本$\\mathbf{y}$而言，越大的$w$意味着上一轮的模型结果越糟糕，此时负责预测$P(S(\\mathbf{y})=k)$的基模型就要加大对该样本的重视程度以获得较小的损失。\n",
    "\n",
    "但是，此时基模型本身是不带权重的，SAMME.R采用的近似方法是，考虑以$w$为权重的基模型$G$，用其输出$P_w(s(\\mathbf{y})=k\\vert \\mathbf{x})$的概率值来代替$\\left.w\\right|_{S(\\mathbf{y})=k}\\cdot P(S(\\mathbf{y})=k\\vert \\mathbf{x})$，这种行为合法的原因在于权重对于总体损失的惩罚方向是一致的，$G$通过权重$w$将原本作用于$L$的损失近似地“分配”给了基分类器的损失。\n",
    "\n",
    "此时，损失函数近似为\n",
    "\n",
    "$$\n",
    "\\mathbb{E} [L\\vert \\mathbf{x}] = \\sum_{k=1}^K P_w(s(\\mathbf{y})=k\\vert \\mathbf{x})\\exp(-\\frac{h^{(m)}_k(\\mathbf{x})}{K-1})\n",
    "$$\n",
    "\n",
    "由对称约束条件，结合拉格朗日乘子法可得\n",
    "\n",
    "$$\n",
    "h^{*(m)}_{k'}=(K-1)[\\log P_w(S(\\mathbf{y})=k'\\vert \\mathbf{x})-\\frac{1}{K}\\sum_{k=1}^K\\log P(S(\\mathbf{y})=k\\vert \\mathbf{x})]\n",
    "$$\n",
    "\n",
    "```{figure} ../_static/ada_algo3.png\n",
    "---\n",
    "width: 580px\n",
    "align: left\n",
    "---\n",
    "```\n",
    "\n",
    "## 5. Adaboost.R2\n",
    "\n",
    "利用权重重分配的思想，Adaboost还可以应用于处理回归问题。其中，Adaboost.R2算法是一种最常使用的实现。\n",
    "\n",
    "设训练集特征和目标分别为$\\mathbf{X}=(\\mathbf{x}_1, ..., \\mathbf{x}_n)$和$\\mathbf{y}=(y_1,...,y_n)$，权重$\\mathbf{w}$初始化为$(w_1,...,w_n)$。在第$m$轮时，根据权重训练基预测器得到$G^*$，计算每个样本的相对误差\n",
    "\n",
    "$$\n",
    "e_{i}=\\frac{\\vert y_i-G^*(\\mathbf{x}_i)\\vert}{\\max_i \\vert y_i-G^*(\\mathbf{x}_i)\\vert}\n",
    "$$\n",
    "\n",
    "设样本的加权相对误差率为$E^{(m)}=\\sum_{i=1}^n w_ie_i$，则相对误差率与正确率的比值为$\\beta^{(m)}=\\frac{E^{(m)}}{1-E^{(m)}}$，即预测器权重$\\alpha^{(m)}=\\log \\frac{1}{\\beta^{(m)}}$。\n",
    "\n",
    "更新权重$w_i$为$w_{i}[\\alpha^{(m)}]^{1-e_{i}}$，权重在归一化后进入下一轮训练，由此可如下写出训练算法：\n",
    "\n",
    "```{figure} ../_static/ada_algo4.png\n",
    "---\n",
    "width: 380px\n",
    "align: left\n",
    "---\n",
    "```\n",
    "\n",
    "在预测阶段，Adaboost.R2使用的是加权中位数算法。设每个基模型对某一个新测试样本的预测输出为$y_1,...,y_M$，基模型对应的预测器权重为$\\alpha^{(1)},...,\\alpha^{(M)}$，则Adaboost.R2的输出值为\n",
    "\n",
    "$$\n",
    "y=\\inf \\{ y\\big| \\sum_{m\\in \\{m\\vert y_m\\leq y\\}}\\alpha^{(m)} \\geq 0.5 \\sum_{m=1}^M\\alpha^{(m)}\\}\n",
    "$$\n",
    "\n",
    "## 习题\n",
    "\n",
    "### A组\n",
    "\n",
    "1. 二分类问题下，Adaboost算法如何调节样本的权重？\n",
    "2. 样本A在当轮分类错误，且样本B在当轮分类正确，请问在权重调整后，样本A的权重一定大于样本B吗？\n",
    "3. 在处理分类问题时，Adaboost的损失函数是什么？请叙述其设计的合理性。\n",
    "4. Adaboost如何处理回归问题？\n",
    "5. 用已经训练完毕的Adaboost分类模型和回归的模型来预测新样本的标签，请分别具体描述样本从输入到标签输出的流程。\n",
    "\n",
    "### B组\n",
    "\n",
    "1. 假设有一个3分类问题，标签类别为第2类，模型输出的类别标签为[-0.1,-0.3,0.4]，请计算对应的指数损失。\n",
    "2. 什么是贝叶斯最优决策条件？请针对Adaboost的分类问题，给出一个不符合贝叶斯最优决策条件的损失函数。\n",
    "3. 对于二分类问题，请完成以下任务：\n",
    "    - 对公式进行化简，写出$K=2$时的SAMME算法流程，并与李航《统计学习方法》一书中所述的Adaboost二分类算法对比是否一致。\n",
    "    - 在sklearn源码中找出算法流程中每一行对应的处理代码。\n",
    "4. 算法2第12行中给出了$\\mathbf{f}$输出的迭代方案，但在sklearn包的实现中使用了$\\mathbb{I}_{\\{G^*(\\mathbf{x})=S(\\mathbf{y})\\}}$来代替$\\mathbf{b}^{*(m)}(\\mathbf{x})$。请根据本文的实现，对sklearn包的源码进行修改并构造一个例子来比较它们的输出是否会不同。（提示：修改AdaboostClassifier类中的decision\\_function函数和staged\\_decision\\_function函数）\n",
    "5. 证明SAMME.R算法中$h^*_{k'}$的求解结果。\n",
    "6. 算法3的第14行给出了$w_i$的更新策略，请说明其合理性。\n",
    "7. 请实现SAMME、SAMME.R和Adaboost.R2算法。\n",
    "8. 请结合加权中位数的定义解决以下问题：\n",
    "\\begin{itemize}\n",
    "    - 当满足什么条件时，Adaboost.R2的输出结果恰为每个基预测器输出值的中位数？\n",
    "    - Adaboost.R2模型对测试样本的预测输出值是否一定会属于$M$个分类器中的一个输出结果？若是请说明理由，若不一定请给出反例。\n",
    "    - 设$k\\in \\{y_1,...,y_M\\}$，记$k$两侧（即大于或小于$k$）的样本集合对应的权重集合为$W^+$和$W^-$，请证明使得这两个集合的元素之和差值最小的$k$就是Adaboost.R2的输出值$y$。\n",
    "    - 设$\n",
    "y'=\\sup \\{ y'\\big| \\sum_{m\\in \\{m\\vert y_m\\leq y'\\}}\\alpha^{(m)} \\leq 0.5 \\sum_{m=1}^M\\alpha^{(m)}\\}\n",
    "$，它和定义中给出的$y$值一定相等吗？若不一定请举出反例。\n",
    "    - 相对于普通中位数，加权中位数的输出结果鲁棒性更强，请结合公式说明理由。"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "text_representation": {
    "format_name": "myst"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "source_map": [
   8
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}